Data Cleaning

There are 12,936 medicines in the dataset with 9 descriptors. According to the FDA, most medicines available can be classified into 40 categories.

Exploring different uses/categories of medicines

Checking average price of medicine by usage

Aim 1: To cluster/label medicines according to the 40 categories listed by the FDA.

Aim 2: To build model which predicts price of medicines based on its details.

Features to be used: activeIngredient, Name, Price, Dosage Form, therapeutic_class, Mechanism of Action, Uses, benefits, description, product info. Columns Description, Product Info, and Benefits to be concatenated.

Now we can work with the medicines_final dataset

Labeling Medicines according to FDA

Since we do not have any information on the correct labels for the medicines, we will have to employ Unsupervised learning methods.

AIM 1: Classifying medicines according to categories listed by the FDA.

Method 1: By measuring cosine similarity between concatenated embeddings of medicine information, description, and beneifts with embeddings of Drug Categories.

Method 2: Performing K Means Clustering (K=40) by concatenating embeddings of Active Ingredients, Therapeutic Class, and Combined Description for each medicine, creating vectors of dimension 600. Further generating average mebeddings for 'Types of Uses' corresponding to each cluster and then calculating cosine similarity with Drug Catgory embeddings.

Method 3: By performing Topic Modeling through Latent Dirichlet Analysis on combined text from product information, description, and benefits. Identifying which topic does the combined document have the highest probability of belonging to, extracting the top 20 words, generating their embeddings, and building a cosine similarity matrix with embeddings of Drug Categories.

AIM 2: Finding substitute medicines for each medicine.

Method: Generating embeddings for combined textual description of product and comparing cosine similiarity with other medicines in the dataset.

AIM 3: Building a Price Prediction Model

Method 1: Linear Regression of Price on combined textual description of product, Active ingredients, and Therapeutic Class

Method 2: Random Forest Regression of Price on combined textual description of product, Active ingredients, and Therapeutic Class

Method 3: Building a Neural Network with a single hidden layer

AIM 1: Categorisation

Creating functions to calculate average embeddings.

AIM 1: Method 1 - Cosine Similarity between Medicine combined description and Drug Category Description.

AIM 1: Method 2 - Perfroming K Means Clustering. (Comparison with Drug Categories done later)

AIM 1: Method 3 - Topic Modeling

Analysing medicines for each category acc to classification by description similarities.

AIM 1: Method 3 - Performing Classification by omcparing cluster values to drug category descriptions via 'uses' aggregation for each cluster.

Dataset with results from all 3 Methods of Classification

Exploring difference/distance between drug category embeddings

The more common drug category ailments, eg Cold Cures, are much closer to most other categories compared to ailments which are less common, eg Barbiturates. This also helps explain why these common ailments have more medicines classified within them. Since they are similar to all embeddings, medicines belonging to other categories may be classified as common ailment categories

From the histogram above we notice that the average euclidean distance between the embeddings of the 3 categories for each medicine (one for each method) is normally distributed around 0.8 with a spike at 0-0.1, implying extremely similar categories from each method.

Analysing distribution of medicines by Categories for each of the three methods.

The graph above provides an approximation of medicines distribution by category.

Not all Drug Categories were present in the final classification results from any of the three methods

AIM 2: Price Modeling

Linear Regression

Random Forest

Neural Network

AIM 3: Substitution